The following documents my progress in the course Introduction to Open Data Science. The course includes basic R, RStudio, RMarkdown and GitHub usage, as well as several more advanced topics. I am saving all the weekly assignments and the course diary in a GitHub repository available here.
## [1] "Mon Nov 23 16:27:06 2020"
I have some skills in statistical methods and the use of R but I am unfamiliar with GitHub (except as a place for plundering code). I have done most of work with longitudinal register data and with parametric methods, and, based on the course description, I hope that the course would provide me new knowledge on:
But that’s probably enough for now, more to follow during next weeks.
In this weeks’ assignment I am looking into the association between students’ learning approaches and their learning outcomes. The assignment is based on data collected during a Statistics course. The data consist of a question pattern that includes 32 items measuring three different dimensions of learning. In addition, there are information available on participants’ global attitude towards statistics, gender, age and course exam points. More information on the data can be found here.The data I am using in this assignment is a modified version of the original data (see below).
#Load packages
library(tidyverse)
library(GGally)
#Read in the data
setwd("~/IODS-project/data")
learning_2014 <- read.table("learning_2014.txt")
str(learning_2014)
## 'data.frame': 166 obs. of 7 variables:
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ gender : chr "F" "M" "F" "M" ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
head(learning_2014)
## age gender attitude points deep stra surf
## 1 53 F 3.7 25 3.583333 3.375 2.583333
## 2 55 M 3.1 12 2.916667 2.750 3.166667
## 3 49 F 2.5 24 3.500000 3.625 2.250000
## 4 53 M 3.5 10 3.500000 3.125 2.250000
## 5 49 M 3.7 22 3.666667 3.625 2.833333
## 6 38 F 3.8 21 4.750000 3.625 2.416667
summary(learning_2014)
## age gender attitude points
## Min. :17.00 Length:166 Min. :1.400 Min. : 7.00
## 1st Qu.:21.00 Class :character 1st Qu.:2.600 1st Qu.:19.00
## Median :22.00 Mode :character Median :3.200 Median :23.00
## Mean :25.51 Mean :3.143 Mean :22.72
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:27.75
## Max. :55.00 Max. :5.000 Max. :33.00
## deep stra surf
## Min. :1.583 Min. :1.250 Min. :1.583
## 1st Qu.:3.333 1st Qu.:2.625 1st Qu.:2.417
## Median :3.667 Median :3.188 Median :2.833
## Mean :3.680 Mean :3.121 Mean :2.787
## 3rd Qu.:4.083 3rd Qu.:3.625 3rd Qu.:3.167
## Max. :4.917 Max. :5.000 Max. :4.333
The modified data used in this assignment contains 7 variables and 166 students. Participants’ age (17-55 years, mean 25.5 years), gender (male/female) and study exam score (7-33 points, mean score 22.7) are recorded in the corresponding variables. Variables deep, stra and surf measure the dimensions of learning approches and study skills: deep, strategic and surface learning. I have transformed these variables from the original question pattern by taking the mean of all the questions measuring a specific dimension. The possible values of the variables range between 1 and 5. The variable attitude measures student’s global attitude towards statistics. This is a sum variable consisting of 10 items, which I have scaled back to the original scale of the questions by taking the mean of these items (scale 1-5).
Let’s have a look at the data with the function ggpairs:
#Lower adds the regression lines
#Upper scales the text in correlation
#boxes
ggpairs(learning_2014,
aes(
col=gender,alpha=0.5),
lower = list(continuous =
wrap("smooth")),
upper = list(continuous =
wrap("cor", size =
2)))
Figure 1: Graphical overview of the Learning 2014 data
Figure 1 summarizes the data. On the diagonal of the matrix of plots, there are density plots for continous variables and bar plots for discrete gender. The diagonal thus shows the distribution of each variable. In the upper triangle, there is a correlation matrix for the continuous variables and a boxplot for each variable by gender. The lower triangle contains scatter plots for continuous variables and histograms of these variables by gender.
There are more females in the data, and they are slightly younger than males.There are indications of slight gender differences, especially in the attitude and surface learning variables (see the density plots & boxplots). However, since I am mostly interested in the association between learning approaches and outcomes (exam score), I will discard gender and differences to make the plot more readable (Figure 2).
ggpairs(learning_2014 %>%
select(c("attitude",
"surf",
"deep",
"stra",
"points")),
lower = list(continuous =
wrap("smooth")),
upper = list(continuous =
wrap("cor", size =
2)))
Figure 2: Graphical overview of the Learning 2014 data
Figure 2 indicates that the strongest predictor of student’s exam score is student’s attitude towards statistics. The association is linear and positive: the higher the attitude, the higher the exam score. Of the learning approaches, strategic dimension seems to be positively and surface learning negatively associated with exam score. Based on correlation coefficients, both of these associations are relatively weak. Deep learning seems not to be correlated with exam score.Deep learning seems also to be distributed less evenly as the other dimensions: less students have low scores on this dimension.
Based on the investigation above, I assume that the exam score is associated with attitude, strategic and surface learning. I will examine this association by fitting a linear regression model:
model1 <-
lm(points ~
attitude + stra + surf,
data=learning_2014)
summary(model1)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning_2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
The model output confirms the reasoning based on the plots above. Attitude has a positive association with exam score, as does higher score in strategic learning. High score on surface learning decreases exam score. The multiple R-squared suggests that 21% of the variation in exam score is explained by these three variables. However, neither strategic nor surface learning are statistically significantly different from 0 on the conventional confidence level 95%, i.e. the p-value of the test that the coefficient equals 0 is larger than 0.05.Thus, I exclude these variables and my final model includes only attitude as an explanatory variable.
model2 <-
lm(points ~
attitude,
data=learning_2014)
summary(model2)
##
## Call:
## lm(formula = points ~ attitude, data = learning_2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Based on the summary of the model, attitude is statistically significantly associated with exam score. On average, a 1 point increase in attitude increases the exam score by around 3 points.The intercept estimates the points an individual with 0 score on attitude would have (around 12 points). Multiple R-squared shows that 19% of variation in exam score is explained by differences inthe global attitude towards statistics.
Figure 3 shows the basic scatter plot of exam scores and attitude, and the fitted regression lines in upper left corner. In the other panels, I have plotted the basic diagnostic plots Residuals vs. Fitted values (upper right), Normal QQ-plot (lower left) and Residuals vs. Leverage (lower right).
par(mfrow=c(2,2))
plot(y=learning_2014$points,
x=learning_2014$attitude)
abline(lm(points~attitude,data=learning_2014))
plot(model1,which=c(1,2,5))
Figure 3: Diagnostic plots
The basic assumption of a linear regression model is that the relationship between the variables is linear. From the scatter plot, this assumption seems to hold, although there are some somewhat outlying observations with low test scores and high attitude score. The linarity assumpption is also (mostly) confirmed by the residuals vs. fitted values plot. If the linearity assumption would not hold, we would expect to see very large residuals. There seems to be several outliers in this case with high residuals.
The second assumption of the model is homoscedasticity of residuals. This means that the variance of the residuals is equal regardless of the values of the explanatory variables. From the residuals vs. fitted values, this seems to be the case. There are no distinct patterns between the residuals and fitted values. Third assumption of the model is that the residuals have a normal distribution. According to the QQ-plot, this assumption seems to hold as the residuals seem to fit the theoretical (normal) distribution relatively well.
The fourth assumption of the model is independence between observations. This can not be examined through diagnostic plots. We can assume that the attitudes and exam scores should be independent from each other for the most parts. There seems to be a small cluster of individuals with high attitude score and low exam score, which could be speculated to be a group of friends that like statistics but whose previous night was a bit too long. This would be a violation of the independence observation. The residuals vs. leverage plot suggest that there are no observations that would have a major impact to the regression results. This is reassuring in terms of the model’s sensitivity to these outliers. All in all, the model seems to be a relatively good one. Whether it is really useful is another thing.
This weeks’ assignment focuses on the risk factors of high alcohol use among secondary school students in Portugal. I am using data collected from two schools in Portugal (Cortez and Silva 2008). The data includes information on weekly alcohol use, grades in Mathematics and Portuguese, as well as data on several background variables, such as parental education. I will start the analysis by graphically exploring the dataset, then move on to using logistic regression to describe relevant associations, and finally assess the model validity by cross-validation exercises.
#Load packages
library(tidyverse)
library(GGally)
library(patchwork)
library(openxlsx)
#Read in the data
setwd("~/IODS-project/data")
alc_use <- read.xlsx("alc_use.xlsx")
I have modified the data by combining weekday and weekend alcohol use and constructed a binary variable indicating high alcohol use (average daily alcohol consumption > 2). Based on previous knowledge on the topic, I assume that males, students with less educated parents, students with problematic family relations and students going out a lot are likely to be at increased risk of high alcohol use.
First, let’s look at parental education. Education is available for both parents, but I will use information on either parent with the highest education.
#Generate variable paredu
#=highest education of either parent
alc_use$paredu <-
ifelse(alc_use$Fedu>alc_use$Medu,
alc_use$Fedu,alc_use$Medu)
#Modify into factor
alc_use$paredu <-
as.factor(alc_use$paredu)
#Plot means of high use
#by parental education
alc_use %>%
group_by(paredu) %>%
summarise(mean_high_use=mean(high_use),
n=n())%>%
ggplot(aes(x=paredu,
y=mean_high_use,
size=n,
col=paredu)) +
geom_point() +
theme_minimal()
Means of high alcohol use by parental education. Size represents the number of observations
prop.table(table(alc_use$paredu))
##
## 1 2 3 4
## 0.09189189 0.23783784 0.25675676 0.41351351
prop.table(table(alc_use$paredu,alc_use$high_use),1)
##
## FALSE TRUE
## 1 0.6764706 0.3235294
## 2 0.7500000 0.2500000
## 3 0.6526316 0.3473684
## 4 0.7058824 0.2941176
chisq.test(alc_use$paredu,alc_use$high_use)
##
## Pearson's Chi-squared test
##
## data: alc_use$paredu and alc_use$high_use
## X-squared = 2.1775, df = 3, p-value = 0.5364
From the figure we see that there are no clear pattern between parental education and high alcohol use. The highest proportion of high alcohol use is at level 3, and lowest at level 2. The Chi-squared test also indicates that there is no statistically significant relationship between these two variables. Moreover, low parental education is relatively uncommon among the students (10%).
Second, I’ll look at the relationship between the quality of family relations and high alcohol use.
alc_use %>%
ggplot(aes(x=high_use,
y=famrel)) +
geom_violin() +
geom_jitter() +
theme_minimal()
Quality of family relations and high alcohol consumption.Violin plot
prop.table(table(alc_use$famrel))
##
## 1 2 3 4 5
## 0.02162162 0.04864865 0.17297297 0.48648649 0.27027027
The violin plot indicates that the distribution of the quality of family relations is rather skewed towards the high values (good relations). In addition, not large differences seem to present between the high use and low use of alcohol.
However, if we look at the distributions closer, it can be seen that around 80% of those without high alcohol consumption have rresponded that their family relations are very good or excellent (4 or 5). Among those with high consumption, there are slightly less these high values. Hence, I transform the variable into a binary one where 1=very good relations, 0=lower.
prop.table(table(alc_use$famrel,alc_use$high_use),2)
##
## FALSE TRUE
## 1 0.02316602 0.01801802
## 2 0.03474903 0.08108108
## 3 0.15057915 0.22522523
## 4 0.49420849 0.46846847
## 5 0.29729730 0.20720721
alc_use$famrel_b <-
ifelse(alc_use$famrel>3,1,0)
table(alc_use$famrel_b,alc_use$high_use)
##
## FALSE TRUE
## 0 54 36
## 1 205 75
prop.table(table(alc_use$famrel_b,alc_use$high_use),1)
##
## FALSE TRUE
## 0 0.6000000 0.4000000
## 1 0.7321429 0.2678571
chisq.test(alc_use$famrel_b,alc_use$high_us)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: alc_use$famrel_b and alc_use$high_us
## X-squared = 5.0515, df = 1, p-value = 0.0246
Within this binary indicator, there is a statistically significant (see Chi-squared test) relationship between lower quality family relations and alcohol use.The probability of high use among those with lower quality relations is around 15% higher.
Finally, let’s look at the last two explanatory variables, sex and going out with friends
ggplot(data=alc_use,
aes(x=high_use,y=goout))+
geom_violin()
prop.table(table(alc_use$goout,alc_use$high_use),2)
##
## FALSE TRUE
## 1 0.07335907 0.02702703
## 2 0.31660232 0.13513514
## 3 0.37451737 0.20720721
## 4 0.15444015 0.34234234
## 5 0.08108108 0.28828829
table(alc_use$sex,alc_use$high_use)
##
## FALSE TRUE
## F 154 41
## M 105 70
prop.table(table(alc_use$sex,alc_use$high_use),1)
##
## FALSE TRUE
## F 0.7897436 0.2102564
## M 0.6000000 0.4000000
chisq.test(alc_use$sex,alc_use$high_use)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: alc_use$sex and alc_use$high_use
## X-squared = 14.921, df = 1, p-value = 0.0001121
Of these variables it is clear that going out has a positive association with high alcohol use, the distribution of going out is more pronounced towards the higher values (See the violin plot as well as the table of percentages). In addition, being a male increases the risk of high use: 40% of males and 20% of females are high alcohol users
My initial model consists of parental education, binary family relations indicator, continuous going out variable and sex of the student:
model_1 <-
glm(high_use ~
paredu + famrel_b + sex + goout,
data=alc_use)
summary(model_1)
##
## Call:
## glm(formula = high_use ~ paredu + famrel_b + sex + goout, data = alc_use)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6052 -0.3183 -0.1386 0.3627 1.0720
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.03823 0.09877 -0.387 0.69890
## paredu2 -0.12227 0.08427 -1.451 0.14764
## paredu3 -0.01163 0.08331 -0.140 0.88905
## paredu4 -0.08778 0.07931 -1.107 0.26910
## famrel_b -0.16560 0.05059 -3.273 0.00117 **
## sexM 0.17968 0.04365 4.116 4.77e-05 ***
## goout 0.14342 0.01925 7.450 6.89e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1728196)
##
## Null deviance: 77.700 on 369 degrees of freedom
## Residual deviance: 62.734 on 363 degrees of freedom
## AIC: 409.41
##
## Number of Fisher Scoring iterations: 2
The summary of the model indicates that parental education decreases the odds of high alcohol use in any of the higher categories of education when compared to the lowest level of parental education. However, the differences are not statistically significant. Being a male and going out a lot increase the odds of high alcohol use, and having good relations with family decreases the odds. Based on this initial model, I will leave parental education out of my final model.
model_2 <-
glm(high_use ~
famrel_b + sex + goout,
data=alc_use)
OR <- exp(coef(model_2))
CI <- exp(confint(model_2))
cbind(OR,CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.8979596 0.7761920 1.0388300
## famrel_b 0.8533955 0.7728747 0.9423051
## sexM 1.1888255 1.0916995 1.2945926
## goout 1.1537940 1.1110075 1.1982283
In my final model, one step increase in going out increases the odds of high alcohol use by 15% (OR=1.15, 95% CI: 1.11, 1.20). Being a male increases the odds of high alcohol use by 19% (OR=1.19, 95% CI: 1.09, 1.29) and having good family relations decrease the odds by 15% (OR=0.85, 95% CI: 0.77,0.94). Since the confidence intervals do not include 1, all these association are statistically significant at 95% confidence level.
probs <- predict(model_2,
type = "response")
alc_use <-
mutate(alc_use, probability = probs)
alc_use <-
mutate(alc_use,
predicted_high=probability>0.5)
#table
table(
high_use = alc_use$high_use,
prediction = alc_use$predicted_high)
## prediction
## high_use FALSE TRUE
## FALSE 253 6
## TRUE 76 35
#Proportions
table(
high_use = alc_use$high_use,
prediction = alc_use$predicted_high) %>%
prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.68378378 0.01621622 0.70000000
## TRUE 0.20540541 0.09459459 0.30000000
## Sum 0.88918919 0.11081081 1.00000000
#Graphical assessment
ggplot(alc_use,
aes(x = probability,
y = high_use,
col=predicted_high)) +
geom_point(size=2)
# loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
#average number of wrong predictions
loss_func(class = alc_use$high_use,
prob = alc_use$probability)
## [1] 0.2216216
The total, and average number, of inaccurately classified individuals is around 22%.The model estimates that only 11% would be high users whereas in the observed data, 30% were identified as high users. On the other hand, almost all of the high use predictions were true high users. Therefore, the estimates seem conservative, which is reassuring if any inference would be made. Erroneous TRUE predictions occured mostly for lower levels of predicted probabilities (see the Figure).
I am not sure how to compare this to a simple guessing strategy but I guess(!) that this would mean that if I simply try to guess one person’s alcohol consumption, there is a 50% chance I am wrong. Hence, I think the model is somewhat better than just guessing.
library(boot)
## Warning: package 'boot' was built under R version 4.0.3
cross_validation <-
cv.glm(data = alc_use,
cost = loss_func,
glmfit = model_2, K = 10)
cross_validation$delta[1]
## [1] 0.227027
My model does seem to have better test performance when compaerd to the model in Datacamp: My average prediction error is around 23-24% (depends on the predictions) and in the Datacamp model it was 26%.
Finally, I will try different including different sets of predictors and see what happens to the training and testing errors. I will start with a large model with 10 predictors, and move towards 1.
m_1_formula <-
as.formula('high_use~
sex + age +
famrel_b + goout +
paredu + health +
absences + romantic +
studytime +
school')
m_2_formula <-
as.formula('high_use~
sex + age +
famrel_b + goout +
paredu + health +
absences + romantic +
school')
m_3_formula <-
as.formula('high_use~
sex + age +
famrel_b + goout +
paredu + health +
absences +
school')
m_4_formula <-
as.formula('high_use~
sex + age +
famrel_b + goout +
paredu + health +
school')
m_5_formula <-
as.formula('high_use~
sex + age +
famrel_b + goout +
paredu + school')
m_6_formula <-
as.formula('high_use~
sex + age +
famrel_b + goout +
school')
m_7_formula <-
as.formula('high_use~
sex + age +
famrel_b + goout')
m_8_formula <-
as.formula('high_use~
sex + age +
famrel_b')
m_9_formula <-
as.formula('high_use~
sex + age')
m_10_formula <-
as.formula('high_use~
sex')
library(purrr)
#is_formula(m_10_formula)
model_vector <- c(
m_1_formula,
m_2_formula,
m_3_formula,
m_4_formula,
m_5_formula,
m_6_formula,
m_7_formula,
m_8_formula,
m_9_formula,
m_10_formula)
#is_formula(model_vector[[1]])
error_mat <- matrix(NA,nrow=3,ncol=10)
for(i in 1:10){
alc_temp <- alc_use
current_model <- model_vector[[i]]
current_glm <- glm(current_model,
family='binomial',
data=alc_temp)
probs <- predict(current_glm,
type = "response")
alc_temp <-
mutate(alc_temp, probability = probs)
alc_temp <-
mutate(alc_temp,
predicted_high=probability>0.5)
#average number of wrong predictions
error_mat[1,i] <-
loss_func(class = alc_temp$high_use,
prob = alc_temp$probability)
cross_validation <-
cv.glm(data = alc_temp,
cost = loss_func,
glmfit = current_glm, K = 10)
error_mat[2,i] <-
cross_validation$delta[1]
}
labels <- c("10","9","8","7","6",
"5","4","3","2","1")
plot(error_mat[1,],type='l',
col='blue',lwd=2,xaxt='n',
xlab="Number of predictors",
ylab="Size of error")
lines(error_mat[2,],col='orange',
lwd=2)
axis(side=1,labels=labels,at=1:10)
legend(1,y=0.3,legend=c("Training error",
"Testing error"),
col=c('blue','orange'),
lty=c(1,1))
Based on these investigations, it seems that when the number of predictor increase, bot the training and testing error get smaller. Training error, or the average number of wrong predictions, seems to be smaller than the testing error from the 10-fold cross-validation exercises. When the number of predictors gets smaller (below 4), both the testing and training errors increase and get closer together.
Data reference: P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7. https://archive.ics.uci.edu/ml/datasets/Student+Performance
Today’s exercise exercise will focus on different techniques of clustering and classification. I will use data on housing in areas of Boston and mostly focus on the crime rate in the city. The data can be accessed through the R package MASS. The data contains area-level information on the characteristics of homes (size, value etc.), the demographic composition of the area as well as several variables related to environmental and infrastructural factors. More information on the data is available here and in the original study by Harrison & Rubinfeld (1978).
BTW, I decided to hide my code chunks by default in this course diary as they make reading a bit tedious. You should be able to get the codes by clicking the Code button next to results.
library(MASS)
library(tidyverse)
library(GGally)
library(corrplot)
data(Boston)
dim(Boston)
## [1] 506 14
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
hist(Boston$crim)
The data includes 14 variables from some 500 regions in Boston. All of them are numeric.My main varibale of interest, crime rate, seems highly skewed, with most of the areas having low rates of crime and a few expressing higher rates.
p.values.mat <-cor.mtest(Boston,
conf.level = .95)
cor.mat <- cor(Boston)
corrplot.mixed(cor.mat,
lower.col='black',
upper='color',
tl.col='black',
tl.cex=0.5,
number.cex=0.5,
p.mat=p.values.mat$p,
sig.level=0.05)
Correlation plot
For a graphical overview of data, I am using the correlation plot to make the plot to some extent readable (compared to pairs or ggpairs). In the plot, I have crossed out all the correlations not significant at 95% confidence level. It seems that the variable chas is not significantly correlated with most of the variables (probably as it is binary). Anyhow, the variable indicates whether the area is bounded by river Charles, which makes little sense to me.
Most of the other variables are statistically correlated, and there seems to be relatively strong correlations, for instance property tax rate (tax) and access to highways (rad) have a correlation of 0.91, age (age) of houses and distance from city centre (dis) have a correlation of -0.75, and nitrous ocygen emissions and proportion of non-retail businesses (=industry) have a correlation of 0.76. These are rather intuitive. Most of the correlations seem moderate, between 0.3 and 0.6.The highest correlations between crime rate occur for variables access to radial highways and property tax rate.
As a next part of the assignment, I am running a Linear Discriminant Analysis (LDA). LDA allows for classifying observations to pre-known categories, based on linear associations between variables in the data.There are two assumptions in the model, the variables are normally distributed and has the same variance. Thus, the process starts by scaling the data.The effects of scaling can be seen on the following summary: all the variables are centered around their mean, which is now 0.
#Scale boston data
boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
Next, I will first calculate target classes for the LDA model. I am using the crime rate in the areas and dividing it into quartiles. Then, I divide the data into train and test sets by randomly sampling 80% of the areas (train). The rest of the areas are included in the test set.Lastly, I fit the LDA model on the training data and test the model by conducting predictions with the test set
#Save categories of crime rate
bins <- quantile(boston_scaled$crim)
#Create new crime variable
crime <- cut(boston_scaled$crim,
breaks=bins,
include.lowest=T,
label=c(
"low",
"med_low",
"med_high",
"high"))
boston_scaled$crim <- NULL
boston_scaled$crime <- crime
#Divide data into test and training sets
n <- nrow(boston_scaled)
#Randomly sample 80% of the original rows
#These are used for training
ind <- sample(n, size=n*0.8)
#Train set
train <- boston_scaled[ind,]
#Test set
test <- boston_scaled[-ind,]
#Correct classes in the test set
correct <- test$crime
#Drop crime from test
test <- dplyr::select(test, -crime)
#LDA model
lda.fit <-
lda(crime~., data=train)
# the function for lda biplot arrows
#(Stolen from Datacamp)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# predict classes with test data
lda.pred <- predict(lda.fit,
newdata = test)
Let’s have a look at the results.
# plot the lda results
plot(lda.fit, dimen = 2,
col=classes,
pch=classes)
lda.arrows(lda.fit, myscale = 2)
#summary of the model
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2599010 0.2549505 0.2549505 0.2301980
##
## Group means:
## zn indus chas nox rm age
## low 0.97425764 -0.9238894 -0.159840490 -0.8909298 0.3940558 -0.8910670
## med_low -0.06533749 -0.3495082 -0.004759149 -0.5854509 -0.1266602 -0.3574322
## med_high -0.36984842 0.1672645 0.109913674 0.3394447 0.2133584 0.4279554
## high -0.48724019 1.0173143 -0.060657012 1.0729111 -0.3750647 0.8201536
## dis rad tax ptratio black lstat
## low 0.8970000 -0.6810822 -0.7500341 -0.37010071 0.38130833 -0.75171116
## med_low 0.4183290 -0.5358646 -0.4911728 -0.09785201 0.36220572 -0.14217722
## med_high -0.4007176 -0.4165578 -0.3193919 -0.27678429 0.09517382 -0.05039298
## high -0.8336634 1.6361396 1.5126335 0.77846144 -0.66909921 0.86691596
## medv
## low 0.49638330
## med_low 0.01258411
## med_high 0.25379554
## high -0.63829230
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.08266125 0.57265734 -1.03898336
## indus 0.06076957 -0.33109926 0.17129115
## chas -0.08357570 0.02233658 0.18999452
## nox 0.36515510 -0.78642298 -1.22127540
## rm -0.12574264 -0.12835791 -0.15686805
## age 0.25640497 -0.35565920 -0.11522901
## dis -0.04658707 -0.16942276 0.29618209
## rad 3.10017827 1.01642852 -0.02977765
## tax 0.07246627 0.03044068 0.58900645
## ptratio 0.07149625 0.01344685 -0.40525274
## black -0.10788396 0.02507596 0.20287249
## lstat 0.22105335 -0.13167025 0.46167653
## medv 0.19430092 -0.37498807 -0.16441355
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9418 0.0437 0.0145
# cross tabulate
#the correct classes vs. predictions
table(correct = correct,
predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 11 9 2 0
## med_low 3 12 8 0
## med_high 0 10 12 1
## high 0 0 0 34
table(correct = correct,
predicted = lda.pred$class) %>%
prop.table(2) %>% round(digits=2)
## predicted
## correct low med_low med_high high
## low 0.79 0.29 0.09 0.00
## med_low 0.21 0.39 0.36 0.00
## med_high 0.00 0.32 0.55 0.03
## high 0.00 0.00 0.00 0.97
The most relevant information is included in the scatterplot, aka biplot, of the first two linear discriminants. The plot shows that the grouping with these LDAs of high crime rate areas was relatively successful, whereas the other groups tend to have more overlap. The arrows in the plot show the importance of each variable, and to which direction the are working. It seems that the access to radiaal highways sparates relatively well the high crime rate group. The summary of the model shows the same: mean of rad is high in the high crime rate group, and lower in others.The rad variable has also a large coefficient from the LD1. These findings indicate that crime occurs in places where there are easy to access highways, or, rather, in places where people come and go (probably city centres and some sort of knots in the public transportation system).
When we look at the predictions done with the test data, we see similarly as in the biplot that the model groups the high crime rate areas mostly correct but struggles more with the others. This is probably related to the skewness of the crime variable: the lowest three classes are relatively similar and the hig crime rate group is somewhat special case.
Okay then, let’s move on from classification to clustering. I will run a k-means algorithm to identify clusters in the Boston data. I start by calculating the Euclidean distances and showing a summary of these, for the reason that the k-means algorithm will use these distances. Second, I start running the k-means algorithm for different cluster numbers, ranging from 1 to 15, and calculate the Total Within Cluster Sum of Squares (TWCSS) after each iteration. Usually, the suitable number of clusters is where the TWCSS has the highest drop.
#Reload and rescale boston
data(Boston)
boston_scaled <-
as.data.frame(scale(Boston))
#Calculate distances between observations
#I use Euclidean for no specific reason
#except that the km algorithm uses it
#by default
distances <- dist(boston_scaled)
summary(distances)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
#Identify correct number of clusters
#Use the TWCSS for this purpose
k_max <- 15 #Arbitrary number
twcss <-
sapply(1:k_max,
function(k){
kmeans(
boston_scaled,k)$tot.withinss})
plot(x=1:k_max,y=twcss,type='l')
#2 seems appropriate
According to the plot, two seems to be the correct number of clusters. I will select that and repeat the k-means algorithm with 2 clusters.
#Run k-means algorithm
km <- kmeans(boston_scaled,centers=2)
#Plot the data set
pairs(boston_scaled[
c(1,5,6,7,8,12,13,14)],
col=km$cluster)
Above, I have selected some variables that seem to show reasonable patterning in the data. The scatter plots show the differences between the two clusters identified with the k-means. Indeed, the clusters appear more or less feasibly separated for each of the variables.
Now, I will combine the k-means algorithm with the LDA. I first identify three target classeswith k-means to be used in LDA, and the fit the model.
km2 <- kmeans(boston_scaled,centers=3)
boston_scaled$km_clust <- km2$cluster
#LDA model (rename to avoid confusion
#in the next step)
lda.fit2 <-
lda(km_clust~., data=boston_scaled)
lda.fit2
## Call:
## lda(km_clust ~ ., data = boston_scaled)
##
## Prior probabilities of groups:
## 1 2 3
## 0.2094862 0.3241107 0.4664032
##
## Group means:
## crim zn indus chas nox rm
## 1 -0.4075892 1.5146367 -1.068814 -0.04947434 -0.9962503 0.92400834
## 2 0.8046456 -0.4872402 1.117990 0.01575144 1.1253988 -0.46443119
## 3 -0.3760908 -0.3417123 -0.296848 0.01127561 -0.3345884 -0.09228038
## age dis rad tax ptratio black
## 1 -1.16762641 1.19486951 -0.5983266 -0.6616391 -0.74378342 0.3538816
## 2 0.79737580 -0.85425848 1.2219249 1.2954050 0.60580719 -0.6407268
## 3 -0.02966623 0.05695857 -0.5803944 -0.6030198 -0.08691245 0.2863040
## lstat medv
## 1 -0.9480974 0.97889973
## 2 0.8719904 -0.68418954
## 3 -0.1801190 0.03577844
##
## Coefficients of linear discriminants:
## LD1 LD2
## crim -0.03134296 -0.14880455
## zn -0.06381527 -1.22350515
## indus 0.61086696 -0.10402980
## chas 0.01953161 0.03579238
## nox 1.00230143 -0.70464917
## rm -0.16285767 -0.44390394
## age -0.07220634 0.59785382
## dis -0.04270475 -0.45498614
## rad 0.71987743 -0.02882054
## tax 0.98285440 -0.70663319
## ptratio 0.22527977 -0.15514668
## black -0.01693595 0.03181845
## lstat 0.18274033 -0.50122677
## medv -0.02892966 -0.64244841
##
## Proportion of trace:
## LD1 LD2
## 0.8409 0.1591
classes <-
as.numeric(boston_scaled$km_clust)
# plot the lda results
plot(lda.fit2, dimen = 2,
col=classes,
pch=classes)
lda.arrows(lda.fit2, myscale = 2)
As can be seen from the biplot, the three identified clusters work relatively nicely with the LDA. We can see three clusters separated relatively well by the variables. Most important separators seem to be access to highways, the age of the building stock and property tax rate. The tax rate operates to same direction with highways, indicating that there might be more expensive housing near good travelling options. Old houses seem to form their own cluster.
As the last part of the analysis, I am comparing the original LDA with the k-means with two clusters. The plots project data points based on the LDA, rather than actual observatioins. In the first plot, the colors represent the four original classes, and in the second, the two k-means clusters.
model_predictors <-
dplyr::select(train, -crime)
# matrix multiplication
matrix_product <-
as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
plot_ly(x = matrix_product$LD1,
y = matrix_product$LD2,
z = matrix_product$LD3,
type= 'scatter3d',
mode='markers',
color=train$crime)
#let's fit still another k-means
#For some reason, the code does not work
#with the original train data
#However, this just loads the boston
#and limirts the same areas so no
#errors here, I guess
data(Boston)
boston_scaled <- as.data.frame(scale(Boston))
train <- boston_scaled[ind,]
km3 <- kmeans(train, centers=2)
plot_ly(x = matrix_product$LD1,
y = matrix_product$LD2,
z = matrix_product$LD3,
type= 'scatter3d',
mode='markers',
color=km3$cluster)
We can see from the plots that the original crime rate reflects in the k-means clusters: Those with high or medium high crime rate form one cluster, and those with lower rates another. Using the k-means clusters as target classes is likely to produce a better model than the original solution with four target groups.
Harrison, D. & Rubinfeld, D.L. Hedonic housing prices and the demand for clean air. 1978. Journal of Environmental Economics and Management 5(1), 81-102.
The fifth week of the course focuses on different dimension reduction techniques. I will be using Principal Component Analysis (PCA) and Multiple Correspondence Analysis (MCA). For practicing these methods, I am doing a few tasks with the United Nations Development Programme data from Human Development Index (HDI) and Gender Inequality Index (GII) databases. The data covers countries across the globe. Metadata are available here.
library(openxlsx)
setwd("~/IODS-project/data/")
human <- read.xlsx("human.xlsx")
library(GGally)
library(tidyverse)
library(corrplot)
summary(human)
## edu_ratio lab_ratio exp_edu life_exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## gni maternal_mx teen_births female_parl
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
ggpairs(human,
upper = list(continuous =
wrap("cor", size =
3)))
corrplot(cor(human))
From the summary above, we can see that all the variables in the data are continuous. We have information on demographic characteristics, including life expectancy (life_exp), maternal mortality (maternal_mx) and teen birth rate (teen_births).Socioeconomic indicators related to gender equality include the ratio of female labour force rate to male labour force rate (lab_ratio) and the ratio of rate of secondary educated females to the rate of secondary educated males (edu_ratio). Lastly, we have information on the proportion of females in the parliament and the gross national income (GNI) and the expected number of years of education (exp_edu). The data are available for 195 countries.
We can start to identify associations between these variables from the ggpairs output and the correlation plot above. First, we see that the demographic indicators, overall life expectancy, maternal mortality rate and teen birth rate are strongly correlated. These are also correlated with the GNI, the expected number of years of education and gender equality in education. Life expectancy is correlated positively with these indicators, whereas maternal mortality and teen births negatively.Gender equality in the labour force or in the parliament does not seem to have strong correlations with any of the variables.
Given that the data used describes multiple aspects of societies, identifying bivariate associations is somewhat uninteresting. Therefore, I will perform a PCA to identify whether the indicators presented above belong to same dimensions and if the dimensions have meaningful relationships between each other. I will start by running the analysis on unmodified data.
#PCA
pca_human <- prcomp(human)
#summary
summary(pca_human)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
#Save proportions of variance explained
proportions <-
round(
summary(pca_human)$importance[2,]*100,
digits=3)
pc_lab <-
paste0(
names(proportions)," (", proportions, "%)")
#biplot
biplot(pca_human,
choices=1:2,
cex=c(0.5,0.4),
xlab=pc_lab[1],
ylab=pc_lab[2])
PCA with unstandardized data. GNI explains all of the variability
Okay then, from the summary of the model we see that the model identified 8 principle components (which is the number ofthe variables) but the first of these explains 99.99%=100% of the variation in the data. If we look at the biplot, we can see that the only important component seems to be the gross national income (I guess many politicians use this type of PCA in their reasonings). The fact that GNI overrides all the other variables is related to the fact that in th unmodified data, all the variables have different variances and the PCA treats the variable with the largest variance as the most important one. Therefore, to actually identify the real dimensions, I need to scale the data and run the analysis again.
#scale data
human_sc <- scale(human)
pca_human_sc <- prcomp(human_sc)
summary(pca_human_sc)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
proportions <-
round(
summary(pca_human_sc)$importance[2,]*100,
digits=3)
pc_lab <-
paste0(
names(proportions)," (", proportions, "%)")
biplot(pca_human_sc,
choices=1:2,
cex=c(0.5,0.4),
xlab=pc_lab[1],
ylab=pc_lab[2])
PCA with scaled data. First principal component measuring overall economic and vital well-being, second gender equality
Now, a much more interesting plot is produced. We can see that the sociodemographic indicators education, GNI, life expectancy, maternal mortality and teen births load to the first principal component. That component explains over 50% of the total variability in the data. We also see that maternal mortality and teen births operate to an opposite direction when compared to the other factors. These makes sense as it would be weird if for instance GNI would increase with increasing rates of maternal mortality and teen births. These correlations were already identified above in the graphical overview step.
Second, the new PCA produced another distinct principal component, which seems to describe gender equality. The gender ratio at the labour market and proportion of females in parliament relate to this component. This dimension seems to be genuinely distinct from the first as the variables related to this component have almost 90 degree angle (meaning low correlation) in the arrows when compared to the indicators influencing dimension one.I might interpret this to indicate that gender equality in labour market and parliament is not related to economic and “vital” well-being in the society.Instead, other factors (maybe values, attitudes etc.) are at play. A surprising thing is that the gender equality in education seems not to belong to the gender equality component. However, this is probably because the variable only includes information on secondary education. It might be that for overall increases well-being, it is necessary to have a population where each member has at least some education. Differences might occur if tertiary education was used as the measure of education.
As the second part of this assignment, I am implementing the MCA to a pre-existing dataset called tea, available in R package FactoMineR. The data contains information of tea drinking habits of 300 individuals. I chose 6 first variables from the data set that measure the time when these people drink tea, and try to identify similar components and in the PCA but this time for categorical data.
library(FactoMineR)
data(tea)
tea_plc_time <-
tea[,1:6]
mca <- MCA(tea_plc_time,graph=F)
plot(mca,habillage="quali",invisible=c("ind"))
The MCA calculates distances between variables in a three-dimensional space (I think, at least). In the plot above, the distances between the variables at first two-dimensions are plotted. We can see that the variable categories opposite to each other (no/yes) are plotted to opposite quadrants of the plot. Second, we see that similar variables are plotted close to each other (for instance not breakfast and not tea time). Third, the variable categories that are well categorized by the dimensions occur further from the center of the plot than others.We can clearly see that especially dinner and lunch seem to determine to be well distinguished. We can also confirm this by looking at the bar plots of the variables: it is clear that there seems to be a relatively small group of lunch or dinner drinkers.
data(tea)
gather(tea[,1:6]) %>%
ggplot(aes(value)) +
facet_wrap("key",scales="free") +
geom_bar()
Let’s continue the analysis by adding the individuals to the plot:
plot(mca,habillage="quali")
Now, we have plotted in the upper left corner the individuals whose tea drinking habits are characterized by drinking tea during lunch and evenings. In th upper right corner we have those individuals who apparently do not drink tea at all (see also the above plot to better see the categories). The lower right corner represents tea drinkers that limit their consumption to dinner time. Finally, the lower left corner includes individuals that want to preserve their good night’s sleep and only drink tea in the mornings and during tea time. The edgiest group seems to be those drinking with dinner as they do not tolerate drinking tea at any other time (Given that the interpretation is correct. May as well be not).